Also called multinomial logistic regression, is a generalization of logistic regression to the case where we want to handle multiple classes.

Specification

Hypothesis:
( $x$ : $m\times d$ , $\theta$ : $d\times K$ )

h θ (x) = ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ P (y = 1 | x; θ) P (y = 2 | x; θ) ⋮ P (y = K | x; θ) ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ = 1 \sum K j = 1 exp ( θ ( j ) ⊤ x ) ⎡ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ exp (θ (1) ⊤ x) exp (θ (2) ⊤ x) ⋮ exp (θ (K) ⊤ x) ⎤ ⎦ ⎥ ⎥ ⎥ ⎥ ⎥

$\begin{align} h_\theta(x) = \begin{bmatrix} P(y = 1 | x; \theta) \\ P(y = 2 | x; \theta) \\ \vdots \\ P(y = K | x; \theta) \end{bmatrix} = \frac{1}{ \sum_{j=1}^{K}{\exp(\theta^{(j)\top} x) }} \begin{bmatrix} \exp(\theta^{(1)\top} x ) \\ \exp(\theta^{(2)\top} x ) \\ \vdots \\ \exp(\theta^{(K)\top} x ) \\ \end{bmatrix} \end{align}$
Cost function (cross entropy):

J (θ) = - ⎡ ⎣ ⎢ ⎢ \sum i = 1 m \sum k = 1 K 1 {y (i) = k} log exp ( θ ( k ) ⊤ x ( i ) ) \sum K j = 1 exp ( θ ( j ) ⊤ x ( i ) ) ⎤ ⎦ ⎥ ⎥

$\begin{align} J(\theta) = - \left[ \sum_{i=1}^{m} \sum_{k=1}^{K} 1\left\{y^{(i)} = k\right\} \log \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}\right] \end{align}$
Optimize with gradient descent:

\nabla θ (k) J (θ) = - \sum i = 1 m [x (i) (1 {y (i) = k} - P (y (i) = k | x (i); θ))]

$\begin{align} \nabla_{\theta^{(k)}} J(\theta) = - \sum_{i=1}^{m}{ \left[ x^{(i)} \left( 1\{ y^{(i)} = k\} - P(y^{(i)} = k | x^{(i)}; \theta) \right) \right] } \end{align}$

Redundancy of Parameters

The actual number of parameters are just $(K-1)n$ , rather than $Kn$ , because probabilities always sum to 1. We can find that subtracting $\psi$ from every $\theta(j)$ does not affect our hypothesis’ predictions at all:

P (y (i) = k | x (i); θ) = exp ( ( θ ( k ) - ψ ) ⊤ x ( i ) ) \sum K j = 1 exp ( ( θ ( j ) - ψ ) ⊤ x ( i ) ) = exp ( θ ( k ) ⊤ x ( i ) ) exp ( - ψ ⊤ x ( i ) ) \sum K j = 1 exp ( θ ( j ) ⊤ x ( i ) ) exp ( - ψ ⊤ x ( i ) ) = exp ( θ ( k ) ⊤ x ( i ) ) \sum K j = 1 exp ( θ ( j ) ⊤ x ( i ) ) .

$\begin{align} P(y^{(i)} = k | x^{(i)} ; \theta) &= \frac{\exp((\theta^{(k)}-\psi)^\top x^{(i)})}{\sum_{j=1}^K \exp( (\theta^{(j)}-\psi)^\top x^{(i)})} \\ &= \frac{\exp(\theta^{(k)\top} x^{(i)}) \exp(-\psi^\top x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)}) \exp(-\psi^\top x^{(i)})} \\ &= \frac{\exp(\theta^{(k)\top} x^{(i)})}{\sum_{j=1}^K \exp(\theta^{(j)\top} x^{(i)})}. \end{align}$
So one can instead set

θ(K)=0⃗ $\theta(K)=\vec 0$ and optimize only with respect to the remaining parameters.
Note that

J(θ) $J(\theta)$ is still convex, but the Hessian is singular, which causes a straightforward implementation of Newton’s method to run into numerical problems.

Reference

http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/

Softmax regression

Specification

Redundancy of Parameters

Reference